{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "respective-kennedy",
   "metadata": {},
   "source": [
    "# Chapter 3: Feature Extraction from Text Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "floating-dividend",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "subsequent-forwarding",
   "metadata": {},
   "source": [
    "Feature extraction is a pivotal step in the text mining process. Essentially, it translates textual data into a numerical form so that machine learning models can understand.  It is the bedrock of many natural language processing tasks. Sklearn provides a suite of tools to efficiently transform text data into a format suitable for machine learning. Through African-context examples, we've witnessed the versatility and applicability of these tools across various textual scenarios. As we venture into more advanced topics, mastering the basics of feature extraction remains paramount.\n",
    "\n",
    "This chapter offers an exploration into sklearn's text feature extraction techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advance-marks",
   "metadata": {},
   "source": [
    "**Learning Objectives:**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "visible-colony",
   "metadata": {},
   "source": [
    "* **Understand Basic Text Representation:** Comprehend the necessity of converting textual data into numerical format for machine learning applications, and appreciate the significance of feature extraction in text mining.\n",
    "\n",
    "* **Master CountVectorizer:** Confidently utilize the CountVectorizer method to transform text documents into a matrix of token counts, distinguishing how individual words and tokens are represented in this format.\n",
    "\n",
    "* **Differentiate Vectorization Techniques:** Discern the differences between TfidfVectorizer and the combination of CountVectorizer with TfidfTransformer. Know when to apply each method based on the task at hand."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "boring-photograph",
   "metadata": {},
   "source": [
    "## Understanding Document-Term Matrix (DTM)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hungry-arthur",
   "metadata": {},
   "source": [
    "\n",
    "The Document-Term Matrix (DTM) is a matrix representation of the text dataset where each row corresponds to a document, and each column represents a term (typically a word), and each cell contains the frequency of the term in the document.\n",
    "\n",
    "Consider two sentences:\n",
    "1. \"I love machine learning.\"\n",
    "2. \"Learning machine algorithms is fun.\"\n",
    "\n",
    "The DTM for these sentences would have a row for each sentence and columns for each unique word.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "following-bailey",
   "metadata": {},
   "source": [
    "## CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aggressive-commission",
   "metadata": {},
   "source": [
    "`CountVectorizer` turns text documents into a matrix of token counts. Each row will represent a document, and each column will represent a token (word), with the value indicating the count of the token in the respective document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "robust-coupon",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "import pandas as pd\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "defensive-baking",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>algorithms</th>\n",
       "      <th>fun</th>\n",
       "      <th>is</th>\n",
       "      <th>learning</th>\n",
       "      <th>love</th>\n",
       "      <th>machine</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   algorithms  fun  is  learning  love  machine\n",
       "0           0    0   0         1     1        1\n",
       "1           1    1   1         1     0        1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Constructing a DTM using the above example\n",
    "\n",
    "sample_sentences = [\"I love machine learning.\", \"Learning machine algorithms is fun.\"]\n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "X0 = vectorizer.fit_transform(sample_sentences)\n",
    "\n",
    "# Convert to a DataFrame for better visualization\n",
    "df = pd.DataFrame(X0.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "beginning-clock",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bustling</th>\n",
       "      <th>cairo</th>\n",
       "      <th>capital</th>\n",
       "      <th>city</th>\n",
       "      <th>egypt</th>\n",
       "      <th>heart</th>\n",
       "      <th>in</th>\n",
       "      <th>is</th>\n",
       "      <th>kenya</th>\n",
       "      <th>lagos</th>\n",
       "      <th>nairobi</th>\n",
       "      <th>nigeria</th>\n",
       "      <th>of</th>\n",
       "      <th>the</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   bustling  cairo  capital  city  egypt  heart  in  is  kenya  lagos  \\\n",
       "0         0      0        1     0      0      0   0   1      1      0   \n",
       "1         1      0        0     1      0      0   1   1      0      1   \n",
       "2         0      1        0     0      1      1   0   1      0      0   \n",
       "\n",
       "   nairobi  nigeria  of  the  \n",
       "0        1        0   1    1  \n",
       "1        0        1   0    0  \n",
       "2        0        0   1    1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_1 = [\"Nairobi is the capital of Kenya.\", \n",
    "          \"Lagos is a bustling city in Nigeria.\", \n",
    "          \"Cairo is the heart of Egypt.\"]\n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "vectorizer_2 =  CountVectorizer(stop_words='english')\n",
    "X1= vectorizer.fit_transform(docs_1)\n",
    "\n",
    "# Convert to a DataFrame for better visualization\n",
    "capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "capitals_df\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "planned-lambda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bustling</th>\n",
       "      <th>cairo</th>\n",
       "      <th>capital</th>\n",
       "      <th>city</th>\n",
       "      <th>egypt</th>\n",
       "      <th>heart</th>\n",
       "      <th>kenya</th>\n",
       "      <th>lagos</th>\n",
       "      <th>nairobi</th>\n",
       "      <th>nigeria</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   bustling  cairo  capital  city  egypt  heart  kenya  lagos  nairobi  \\\n",
       "0         0      0        1     0      0      0      1      0        1   \n",
       "1         1      0        0     1      0      0      0      1        0   \n",
       "2         0      1        0     0      1      1      0      0        0   \n",
       "\n",
       "   nigeria  \n",
       "0        0  \n",
       "1        1  \n",
       "2        0  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer_2 =  CountVectorizer(stop_words='english')\n",
    "X1= vectorizer_2.fit_transform(docs_1)\n",
    "\n",
    "# Convert to a DataFrame for better visualization\n",
    "capitals_df = pd.DataFrame(X1.toarray(), columns=vectorizer_2.get_feature_names_out())\n",
    "capitals_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "adequate-forth",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['african' 'and' 'are' 'is' 'lessons' 'life' 'offer' 'proverbs' 'sayings'\n",
      " 'wealth' 'wisdom' 'wise']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>african</th>\n",
       "      <th>and</th>\n",
       "      <th>are</th>\n",
       "      <th>is</th>\n",
       "      <th>lessons</th>\n",
       "      <th>life</th>\n",
       "      <th>offer</th>\n",
       "      <th>proverbs</th>\n",
       "      <th>sayings</th>\n",
       "      <th>wealth</th>\n",
       "      <th>wisdom</th>\n",
       "      <th>wise</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   african  and  are  is  lessons  life  offer  proverbs  sayings  wealth  \\\n",
       "0        1    0    1   0        0     0      0         1        1       0   \n",
       "1        0    1    0   0        1     1      1         1        0       0   \n",
       "2        0    0    0   1        0     0      0         0        0       1   \n",
       "\n",
       "   wisdom  wise  \n",
       "0       0     1  \n",
       "1       1     0  \n",
       "2       1     0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_2 = [\"African proverbs are wise sayings.\", \n",
    "          \"Proverbs offer wisdom and life lessons.\", \n",
    "          \"Wisdom is wealth.\"]\n",
    "\n",
    "X2 = vectorizer.fit_transform(docs_2)\n",
    "print(vectorizer.get_feature_names_out())\n",
    "capitals_df = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "capitals_df\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "intelligent-drill",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2nd</th>\n",
       "      <th>54</th>\n",
       "      <th>africa</th>\n",
       "      <th>continent</th>\n",
       "      <th>countries</th>\n",
       "      <th>has</th>\n",
       "      <th>in</th>\n",
       "      <th>is</th>\n",
       "      <th>it</th>\n",
       "      <th>kilimanjaro</th>\n",
       "      <th>largest</th>\n",
       "      <th>mountain</th>\n",
       "      <th>tallest</th>\n",
       "      <th>the</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   2nd  54  africa  continent  countries  has  in  is  it  kilimanjaro  \\\n",
       "0    0   1       1          0          1    1   0   0   0            0   \n",
       "1    1   0       0          1          0    0   0   1   1            0   \n",
       "2    0   0       1          0          0    0   1   1   0            1   \n",
       "\n",
       "   largest  mountain  tallest  the  \n",
       "0        0         0        0    0  \n",
       "1        1         0        0    1  \n",
       "2        0         1        1    1  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Token Patterns (extracting only words without numbers)\n",
    "\n",
    "vectorizer_3 = CountVectorizer(token_pattern=r'\\b\\w+\\b')\n",
    "docs_3 = [\"Africa has 54 countries.\", \n",
    "          \"It is the 2nd largest continent.\", \n",
    "          \"Kilimanjaro is the tallest mountain in Africa.\"]\n",
    "\n",
    "X3 = vectorizer_3.fit_transform(docs_3)\n",
    "\n",
    "facts_df = pd.DataFrame(X3.toarray(), columns=vectorizer_3.get_feature_names_out())\n",
    "facts_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "seeing-project",
   "metadata": {},
   "source": [
    "## TfidfVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "tired-inside",
   "metadata": {},
   "source": [
    "`TfidfVectorizer` converts text documents into a matrix of token counts and then transforms this count matrix into a tf-idf representation. Tf-idf stands for \"Term Frequency-Inverse Document Frequency\". It's a way to score the importance of words (tokens) in the document based on how frequently they appear across multiple documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "studied-tuner",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>africa</th>\n",
       "      <th>and</th>\n",
       "      <th>change</th>\n",
       "      <th>equality</th>\n",
       "      <th>for</th>\n",
       "      <th>fought</th>\n",
       "      <th>freedom</th>\n",
       "      <th>in</th>\n",
       "      <th>leader</th>\n",
       "      <th>leadership</th>\n",
       "      <th>mandela</th>\n",
       "      <th>nelson</th>\n",
       "      <th>profound</th>\n",
       "      <th>saw</th>\n",
       "      <th>south</th>\n",
       "      <th>under</th>\n",
       "      <th>was</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.324124</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.426184</td>\n",
       "      <td>0.426184</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.251711</td>\n",
       "      <td>0.426184</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.324124</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.426184</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.432385</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.432385</td>\n",
       "      <td>0.432385</td>\n",
       "      <td>0.432385</td>\n",
       "      <td>0.432385</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.255374</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.298174</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.392063</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.392063</td>\n",
       "      <td>0.231559</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.392063</td>\n",
       "      <td>0.392063</td>\n",
       "      <td>0.298174</td>\n",
       "      <td>0.392063</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     africa       and    change  equality       for    fought   freedom  \\\n",
       "0  0.324124  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   \n",
       "1  0.000000  0.432385  0.000000  0.432385  0.432385  0.432385  0.432385   \n",
       "2  0.298174  0.000000  0.392063  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "         in    leader  leadership   mandela    nelson  profound       saw  \\\n",
       "0  0.426184  0.426184    0.000000  0.251711  0.426184  0.000000  0.000000   \n",
       "1  0.000000  0.000000    0.000000  0.255374  0.000000  0.000000  0.000000   \n",
       "2  0.000000  0.000000    0.392063  0.231559  0.000000  0.392063  0.392063   \n",
       "\n",
       "      south     under       was  \n",
       "0  0.324124  0.000000  0.426184  \n",
       "1  0.000000  0.000000  0.000000  \n",
       "2  0.298174  0.392063  0.000000  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " # Basic Tf-idf Scores\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "docs_4 = [\"Nelson Mandela was a leader in South Africa.\", \n",
    "      \"Mandela fought for freedom and equality.\", \n",
    "      \"South Africa saw profound change under Mandela's leadership.\"]\n",
    "\n",
    "\n",
    "vectorizer_4 = TfidfVectorizer()\n",
    "\n",
    "X4 = vectorizer_4.fit_transform(docs_4)\n",
    "\n",
    "sa_facts = pd.DataFrame(X4.toarray(), columns=vectorizer_4.get_feature_names_out())\n",
    "sa_facts\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "severe-external",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>africa</th>\n",
       "      <th>and</th>\n",
       "      <th>animals</th>\n",
       "      <th>are</th>\n",
       "      <th>big</th>\n",
       "      <th>both</th>\n",
       "      <th>cats</th>\n",
       "      <th>cheetahs</th>\n",
       "      <th>elephants</th>\n",
       "      <th>fastest</th>\n",
       "      <th>for</th>\n",
       "      <th>found</th>\n",
       "      <th>in</th>\n",
       "      <th>known</th>\n",
       "      <th>land</th>\n",
       "      <th>large</th>\n",
       "      <th>lions</th>\n",
       "      <th>the</th>\n",
       "      <th>their</th>\n",
       "      <th>tusks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.267485</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.207726</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.267485</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.267485</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.351711</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.277601</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.215582</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.365011</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.365011</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.277601</td>\n",
       "      <td>0.365011</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.365011</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.365011</td>\n",
       "      <td>0.365011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.450504</td>\n",
       "      <td>0.266075</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.342620</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.450504</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.450504</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.450504</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     africa       and   animals       are       big      both      cats  \\\n",
       "0  0.267485  0.351711  0.000000  0.207726  0.351711  0.351711  0.351711   \n",
       "1  0.277601  0.000000  0.000000  0.215582  0.000000  0.000000  0.000000   \n",
       "2  0.000000  0.000000  0.450504  0.266075  0.000000  0.000000  0.000000   \n",
       "\n",
       "   cheetahs  elephants   fastest       for     found        in     known  \\\n",
       "0  0.267485   0.000000  0.000000  0.000000  0.351711  0.267485  0.000000   \n",
       "1  0.000000   0.365011  0.000000  0.365011  0.000000  0.277601  0.365011   \n",
       "2  0.342620   0.000000  0.450504  0.000000  0.000000  0.000000  0.000000   \n",
       "\n",
       "       land     large     lions       the     their     tusks  \n",
       "0  0.000000  0.000000  0.351711  0.000000  0.000000  0.000000  \n",
       "1  0.000000  0.365011  0.000000  0.000000  0.365011  0.365011  \n",
       "2  0.450504  0.000000  0.000000  0.450504  0.000000  0.000000  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_5 = [\"Lions and cheetahs are both big cats found in Africa.\", \n",
    "          \"Elephants in Africa are known for their large tusks.\", \n",
    "          \"Cheetahs are the fastest land animals.\"]\n",
    "\n",
    "X5 = vectorizer_4.fit_transform(docs_5)\n",
    "\n",
    "\n",
    "sa_facts = pd.DataFrame(X5.toarray(), columns=vectorizer_4.get_feature_names_out())\n",
    "sa_facts\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "coated-assistant",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>african</th>\n",
       "      <th>african countries</th>\n",
       "      <th>and</th>\n",
       "      <th>and diverse</th>\n",
       "      <th>congo</th>\n",
       "      <th>congo rainforest</th>\n",
       "      <th>countries</th>\n",
       "      <th>cuts</th>\n",
       "      <th>cuts through</th>\n",
       "      <th>desert</th>\n",
       "      <th>...</th>\n",
       "      <th>several african</th>\n",
       "      <th>the</th>\n",
       "      <th>the congo</th>\n",
       "      <th>the nile</th>\n",
       "      <th>the sahara</th>\n",
       "      <th>through</th>\n",
       "      <th>through several</th>\n",
       "      <th>vast</th>\n",
       "      <th>vast and</th>\n",
       "      <th>vast desert</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.375716</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.221904</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.375716</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.285742</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.375716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.168071</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.284569</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.177401</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.228436</td>\n",
       "      <td>0.300366</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 30 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    african  african countries       and  and diverse     congo  \\\n",
       "0  0.000000           0.000000  0.000000     0.000000  0.000000   \n",
       "1  0.284569           0.284569  0.000000     0.000000  0.000000   \n",
       "2  0.000000           0.000000  0.300366     0.300366  0.300366   \n",
       "\n",
       "   congo rainforest  countries      cuts  cuts through    desert  ...  \\\n",
       "0          0.000000   0.000000  0.000000      0.000000  0.375716  ...   \n",
       "1          0.000000   0.284569  0.284569      0.284569  0.000000  ...   \n",
       "2          0.300366   0.000000  0.000000      0.000000  0.000000  ...   \n",
       "\n",
       "   several african       the  the congo  the nile  the sahara   through  \\\n",
       "0         0.000000  0.221904   0.000000  0.000000    0.375716  0.000000   \n",
       "1         0.284569  0.168071   0.000000  0.284569    0.000000  0.284569   \n",
       "2         0.000000  0.177401   0.300366  0.000000    0.000000  0.000000   \n",
       "\n",
       "   through several      vast  vast and  vast desert  \n",
       "0         0.000000  0.285742  0.000000     0.375716  \n",
       "1         0.284569  0.000000  0.000000     0.000000  \n",
       "2         0.000000  0.228436  0.300366     0.000000  \n",
       "\n",
       "[3 rows x 30 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer_6 = TfidfVectorizer(ngram_range=(1,2))\n",
    "\n",
    "docs_6 = [\"The Sahara is a vast desert.\", \n",
    "          \"The Nile cuts through several African countries.\", \n",
    "          \"The Congo rainforest is vast and diverse.\"]\n",
    "\n",
    "X6 = vectorizer_6.fit_transform(docs_6)\n",
    "\n",
    "rivers_df = pd.DataFrame(X6.toarray(), columns=vectorizer_6.get_feature_names_out())\n",
    "rivers_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "manual-slide",
   "metadata": {},
   "source": [
    "## TfidfTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "twelve-stanley",
   "metadata": {},
   "source": [
    "While `TfidfVectorizer` takes in raw text and produces tf-idf scores, `TfidfTransformer` is used after `CountVectorizer` to convert the count matrix into a tf-idf representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "first-wednesday",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>accra</th>\n",
       "      <th>for</th>\n",
       "      <th>ghana</th>\n",
       "      <th>gold</th>\n",
       "      <th>historic</th>\n",
       "      <th>hosts</th>\n",
       "      <th>hub</th>\n",
       "      <th>is</th>\n",
       "      <th>its</th>\n",
       "      <th>known</th>\n",
       "      <th>of</th>\n",
       "      <th>resources</th>\n",
       "      <th>several</th>\n",
       "      <th>sites</th>\n",
       "      <th>the</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.349498</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.349498</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.459548</td>\n",
       "      <td>0.349498</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.459548</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.459548</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.403016</td>\n",
       "      <td>0.306504</td>\n",
       "      <td>0.403016</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.306504</td>\n",
       "      <td>0.403016</td>\n",
       "      <td>0.403016</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.403016</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.355432</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.467351</td>\n",
       "      <td>0.467351</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.467351</td>\n",
       "      <td>0.467351</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      accra       for     ghana      gold  historic     hosts       hub  \\\n",
       "0  0.349498  0.000000  0.349498  0.000000  0.000000  0.000000  0.459548   \n",
       "1  0.000000  0.403016  0.306504  0.403016  0.000000  0.000000  0.000000   \n",
       "2  0.355432  0.000000  0.000000  0.000000  0.467351  0.467351  0.000000   \n",
       "\n",
       "         is       its     known        of  resources   several     sites  \\\n",
       "0  0.349498  0.000000  0.000000  0.459548   0.000000  0.000000  0.000000   \n",
       "1  0.306504  0.403016  0.403016  0.000000   0.403016  0.000000  0.000000   \n",
       "2  0.000000  0.000000  0.000000  0.000000   0.000000  0.467351  0.467351   \n",
       "\n",
       "        the  \n",
       "0  0.459548  \n",
       "1  0.000000  \n",
       "2  0.000000  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "\n",
    "docs_7 = [\"Accra is the hub of Ghana.\", \n",
    "          \"Ghana is known for its gold resources.\", \n",
    "          \"Accra hosts several historic sites.\"]\n",
    "\n",
    "\n",
    "count_vect = CountVectorizer()\n",
    "X7_count = count_vect.fit_transform(docs_7)\n",
    "\n",
    "tfidf_transformer = TfidfTransformer()\n",
    "X7_tfidf = tfidf_transformer.fit_transform(X7_count)\n",
    "\n",
    "rivers_df = pd.DataFrame(X7_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n",
    "rivers_df\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "subject-chester",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>africa</th>\n",
       "      <th>biltong</th>\n",
       "      <th>dish</th>\n",
       "      <th>east</th>\n",
       "      <th>food</th>\n",
       "      <th>from</th>\n",
       "      <th>in</th>\n",
       "      <th>is</th>\n",
       "      <th>jollof</th>\n",
       "      <th>originating</th>\n",
       "      <th>popular</th>\n",
       "      <th>rice</th>\n",
       "      <th>snack</th>\n",
       "      <th>south</th>\n",
       "      <th>staple</th>\n",
       "      <th>ugali</th>\n",
       "      <th>west</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.235756</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.399169</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.303578</td>\n",
       "      <td>0.235756</td>\n",
       "      <td>0.399169</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.399169</td>\n",
       "      <td>0.399169</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.399169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.257129</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.435357</td>\n",
       "      <td>0.435357</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.331100</td>\n",
       "      <td>0.257129</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.435357</td>\n",
       "      <td>0.435357</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.247433</td>\n",
       "      <td>0.41894</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.41894</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.247433</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.41894</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.41894</td>\n",
       "      <td>0.41894</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     africa  biltong      dish      east      food     from        in  \\\n",
       "0  0.235756  0.00000  0.399169  0.000000  0.000000  0.00000  0.303578   \n",
       "1  0.257129  0.00000  0.000000  0.435357  0.435357  0.00000  0.331100   \n",
       "2  0.247433  0.41894  0.000000  0.000000  0.000000  0.41894  0.000000   \n",
       "\n",
       "         is    jollof  originating   popular      rice    snack    south  \\\n",
       "0  0.235756  0.399169      0.00000  0.399169  0.399169  0.00000  0.00000   \n",
       "1  0.257129  0.000000      0.00000  0.000000  0.000000  0.00000  0.00000   \n",
       "2  0.247433  0.000000      0.41894  0.000000  0.000000  0.41894  0.41894   \n",
       "\n",
       "     staple     ugali      west  \n",
       "0  0.000000  0.000000  0.399169  \n",
       "1  0.435357  0.435357  0.000000  \n",
       "2  0.000000  0.000000  0.000000  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_8 = [\"Jollof rice is a popular dish in West Africa.\", \n",
    "          \"Ugali is a staple food in East Africa.\", \n",
    "          \"Biltong is a snack originating from South Africa.\"]\n",
    "\n",
    "\n",
    "X8_count = count_vect.fit_transform(docs_8)\n",
    "X8_tfidf = tfidf_transformer.fit_transform(X8_count)\n",
    "\n",
    "afrofoods_df = pd.DataFrame(X8_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n",
    "afrofoods_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "spoken-cocktail",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "antique-moscow",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>across</th>\n",
       "      <th>african</th>\n",
       "      <th>afrobeats</th>\n",
       "      <th>and</th>\n",
       "      <th>are</th>\n",
       "      <th>common</th>\n",
       "      <th>continent</th>\n",
       "      <th>diverse</th>\n",
       "      <th>festivals</th>\n",
       "      <th>genres</th>\n",
       "      <th>highlife</th>\n",
       "      <th>is</th>\n",
       "      <th>music</th>\n",
       "      <th>popular</th>\n",
       "      <th>the</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.26592</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.26592</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.26592</td>\n",
       "      <td>0.202239</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.173595</td>\n",
       "      <td>0.173595</td>\n",
       "      <td>0.132024</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.173595</td>\n",
       "      <td>0.173595</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.173595</td>\n",
       "      <td>0.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.15335</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.116626</td>\n",
       "      <td>0.15335</td>\n",
       "      <td>0.15335</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.15335</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.116626</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.15335</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    across  african  afrobeats       and       are   common  continent  \\\n",
       "0  0.00000  0.26592   0.000000  0.000000  0.000000  0.00000    0.00000   \n",
       "1  0.00000  0.00000   0.173595  0.173595  0.132024  0.00000    0.00000   \n",
       "2  0.15335  0.00000   0.000000  0.000000  0.116626  0.15335    0.15335   \n",
       "\n",
       "   diverse  festivals    genres  highlife       is     music   popular  \\\n",
       "0  0.26592    0.00000  0.000000  0.000000  0.26592  0.202239  0.000000   \n",
       "1  0.00000    0.00000  0.173595  0.173595  0.00000  0.000000  0.173595   \n",
       "2  0.00000    0.15335  0.000000  0.000000  0.00000  0.116626  0.000000   \n",
       "\n",
       "       the  \n",
       "0  0.00000  \n",
       "1  0.00000  \n",
       "2  0.15335  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf_transformer_9 = TfidfTransformer(norm='l1')\n",
    "docs_9 = [\"African music is diverse.\", \n",
    "          \"Afrobeats and Highlife are popular genres.\", \n",
    "          \"Music festivals are common across the continent.\"]\n",
    "\n",
    "X9_count = count_vect.fit_transform(docs_9)\n",
    "X9_tfidf = tfidf_transformer_9.fit_transform(X9_count)\n",
    "\n",
    "afrimusic_df = pd.DataFrame(X9_tfidf.toarray(), columns=count_vect.get_feature_names_out())\n",
    "afrimusic_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "female-james",
   "metadata": {},
   "source": [
    ">#### <font color=#800080>Task 7:</font> <a class=\"anchor\" id=\"Task-1\"></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "collective-wright",
   "metadata": {},
   "source": [
    "## Analyzing News Articles on African Youth Unemployment\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conventional-interface",
   "metadata": {},
   "source": [
    "You're a sociologist who's investigating the portrayal of youth unemployment in African news media. You've collected several news articles discussing youth unemployment in various African nations. Your aim is to identify the most discussed themes and assess the importance of different terms in the articles using feature extraction methods.\n",
    "\n",
    "\n",
    "1. Load the youth employment articles using the following command `%load `youth_emp_article.py`\n",
    "\n",
    "2. Tokenize the articles into individual words.\n",
    "\n",
    "3. Use the CountVectorizer to count word occurrences.\n",
    "\n",
    "4. Use the TfidfVectorizer to compute the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.\n",
    "\n",
    "5. Alternatively, use the TfidfTransformer to compute TF-IDF values if starting with raw count from CountVectorizer.\n",
    "\n",
    "6. Analyze the top terms to understand the main themes in the articles."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "complete-ghana",
   "metadata": {},
   "source": [
    ">#### <font color=#800080>Task 8:</font> <a class=\"anchor\" id=\"Task-1\"></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sunset-fraction",
   "metadata": {},
   "source": [
    "## Analyzing Economic Reports on the African Agricultural Export Potential"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "flying-drive",
   "metadata": {},
   "source": [
    "You're an economist at the African Union's Department of Economic Affairs. With increasing talks about intra-African trade and global exports, you've gathered several economic reports discussing the potential of African agricultural exports and their economic impact. Your goal is to extract insights about the most emphasized agricultural commodities and understand the most significant themes across the reports using text analysis techniques.\n",
    "\n",
    "1. Load the youth economics reports using the following command `%load eco_reports.py`\n",
    "\n",
    "2. Tokenize the economic reports into individual words.\n",
    "\n",
    "3. Use the CountVectorizer to compute the frequency of word occurrences.\n",
    "\n",
    "4. Apply the TfidfVectorizer to determine the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term.\n",
    "Alternatively, if starting with raw counts from CountVectorizer, use the TfidfTransformer to calculate TF-IDF values.\n",
    "\n",
    "5. Evaluate the top terms to decipher the primary commodities and themes in the economic reports.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "polar-serial",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}